Collaboration

No one outside the group

Exploratory Data Analysis

Information on the competition can be found here.

First, to get a general idea of the dataset, let’s make a correlation plot to see the relationships between the variables. Additionally the correlation values of the 8 test variables in relation to the count variable.

##                    [,1]
## season      0.163439017
## holiday    -0.005392984
## workingday  0.011593866
## weather    -0.128655201
## temp        0.394453645
## atemp       0.389784437
## humidity   -0.317371479
## windspeed   0.101369470

It appears that none of the continuous variables have a strikingly clear relationship with the target variable. There appears to be useful explainitory information in some of the factor variables, but largely only in the upper range of the data (see box plots)

Then, let’s look at the spread of the response variable, ‘count’.

ggplot(data = train, aes(x = count)) + geom_density(adjust = 0.77) + labs(title = "Spread ")

fav_stats(train$count)
min Q1 median Q3 max mean sd n missing
1 42 145 284 977 191.5741 181.1445 10886 0

Additionally we can use stepwise regression below to see which predictors are best using only 3 steps. Note VOI stands for “variables of interest” referring to predictors used.

train_VOI <- train %>% 
  select(-casual) %>% 
  select(-registered, -datetime)

null <- lm(count ~ 1, data = train_VOI)
full <- lm(count ~ ., data = train_VOI)
step(null, scope = list(lower = null, upper = full), direction = "both", steps = 3)
## Start:  AIC=113200.1
## count ~ 1
## 
##              Df Sum of Sq       RSS    AIC
## + temp        1  55573847 301599066 111361
## + atemp       1  54265962 302906952 111408
## + humidity    1  35976119 321196795 112046
## + season      1   9540914 347631999 112907
## + weather     1   5911983 351260930 113020
## + windspeed   1   3670227 353502687 113090
## <none>                    357172914 113200
## + workingday  1     48010 357124903 113201
## + holiday     1     10388 357162526 113202
## 
## Step:  AIC=111361
## count ~ temp
## 
##              Df Sum of Sq       RSS    AIC
## + humidity    1  30531115 271067951 110201
## + windspeed   1   4199192 297399874 111210
## + weather     1   4097579 297501488 111214
## + season      1   1443023 300156043 111311
## <none>                    301599066 111361
## + atemp       1     19223 301579843 111362
## + holiday     1     10841 301588226 111363
## + workingday  1        18 301599048 111363
## - temp        1  55573847 357172914 113200
## 
## Step:  AIC=110201.1
## count ~ temp + humidity
## 
##              Df Sum of Sq       RSS    AIC
## + season      1   5990445 265077505 109960
## + atemp       1    638266 270429685 110177
## + windspeed   1     86319 270981631 110200
## + weather     1     52104 271015847 110201
## <none>                    271067951 110201
## + holiday     1      8711 271059240 110203
## + workingday  1      2891 271065060 110203
## - humidity    1  30531115 301599066 111361
## - temp        1  50128844 321196795 112046
## 
## Step:  AIC=109959.9
## count ~ temp + humidity + season
## 
## Call:
## lm(formula = count ~ temp + humidity + season, data = train_VOI)
## 
## Coefficients:
## (Intercept)         temp     humidity       season  
##     164.051        7.859       -3.027       22.280

The results suggest that temperature, humidity and season are the most influential predictors. This doesn’t guarantee that this is the ideal model however so it is still possible to look at other predictors.

We can see the relationship between our response variable and three predictors we selected, to see how bike rentals are impacted by Temperature, Humidity, and whether the day in question is a holiday.

train$holiday1<-0
train$holiday1[which(train$holiday == 0)] <- 'Not Holiday'
train$holiday1[which(train$holiday == 1)] <- 'Holiday'
train$holiday <- as.factor(train$holiday)
p <- plot_ly(train, x = ~temp, y = ~humidity, z = ~count, color= ~holiday1, colors = c('#BF382A', '#0C4B8E')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Humidity'),
                     yaxis = list(title = 'Temperature (Celsius)'),
                     zaxis = list(title = 'Count')))
ggplotly(p)

Model Fit

We are picking a model which uses Windspeed, temperature, and Humidity as the three predictors to predict Count with. JUSTIFICATION:

In the correlation plot above, these 3 variables had the strongest apparent relationship to “count”. Windspeed was used instead of adjusted temperature to avoid obvious covariance between temp and atemp.

The following represents the model being applied to the test set via predict():

train <- read_csv("data/train.csv")
model1<- lm(log(count)~ windspeed+temp+humidity, data=train)
summary(model1)
## 
## Call:
## lm(formula = log(count) ~ windspeed + temp + humidity, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5604 -0.6476  0.2639  0.8750  3.2963 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.511861   0.065391  68.998  < 2e-16 ***
## windspeed    0.005989   0.001616   3.706 0.000212 ***
## temp         0.067915   0.001609  42.216  < 2e-16 ***
## humidity    -0.022783   0.000687 -33.162  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 10882 degrees of freedom
## Multiple R-squared:  0.2334, Adjusted R-squared:  0.2332 
## F-statistic:  1104 on 3 and 10882 DF,  p-value: < 2.2e-16
mplot(model1)[1:2]
## [[1]]

## 
## [[2]]

Test_predict <- predict(model1, newdata = test)

Test_predict_table <- data_frame(Test_predict) %>% 
  mutate(count = exp(Test_predict)) %>% 
  mutate(datetime = sample_submission$datetime) %>%
  select(datetime, count)

All of the predictors in this model are significant. That said, most variants of this model (including when we log and/or take the square roots) violate the equal variance and normality of error conditions.

The following code is intended to be used in re-submission for cross validation

set.seed(3400)
nfold <- 5
nrow(train)/nfold
## [1] 2177.2
k <- 5
Fold_train <- train %>% 
  sample_frac(1) %>% 
  mutate(fold = rep(1:k, len=nrow(train)))

"indices <- sample(nrow(train), nrow(train), replace=FALSE)
splits<-split(indices, ceiling(seq_along(indices)/2177))
splits<-as.data.frame(splits)
d1a <- train[splits$X1,]
d1b <- train[splits$X2,]
d1c <- train[splits$X3,]
d1d <- train[splits$X4,]
d1e <- train[splits$X5,]
folds <- list[d1a,d1b,d1c,d1d,d1e]
rmse1<-0
rmse2<-0
rmse3<-0
rmse4<-0
rmse5<-0
rmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5)"
## [1] "indices <- sample(nrow(train), nrow(train), replace=FALSE)\nsplits<-split(indices, ceiling(seq_along(indices)/2177))\nsplits<-as.data.frame(splits)\nd1a <- train[splits$X1,]\nd1b <- train[splits$X2,]\nd1c <- train[splits$X3,]\nd1d <- train[splits$X4,]\nd1e <- train[splits$X5,]\nfolds <- list[d1a,d1b,d1c,d1d,d1e]\nrmse1<-0\nrmse2<-0\nrmse3<-0\nrmse4<-0\nrmse5<-0\nrmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5)"
"for (i in 1:5){
  rmses[i-1] <- lm(count~temp~workingday+weather, data = subset(Fold_train, fold = i)) %>% 
    predict(subset(Fold_train, fold != i)) %>% 
    `-` (subset(Fold_train, fold != i)$count) %>% 
    `^`(2) %>% 
    sum %>% 
   `/` (subset(Fold_train, fold != i)$count) %>% 
    sqrt
}"
## [1] "for (i in 1:5){\n  rmses[i-1] <- lm(count~temp~workingday+weather, data = subset(Fold_train, fold = i)) %>% \n    predict(subset(Fold_train, fold != i)) %>% \n    `-` (subset(Fold_train, fold != i)$count) %>% \n    `^`(2) %>% \n    sum %>% \n   `/` (subset(Fold_train, fold != i)$count) %>% \n    sqrt\n}"

Create Submission File

write.csv(Test_predict_table, "submission.csv", row.names = F)

FINAL CHECK

. every plot has all labels (including units) . code chunks are hidden in the output . our names + date etc are all up there